Background

Sample covariance

Sample covariance between variables \(x=x_t\) and \(y=y_t\):

\[ c_{xy} = \frac{1}{N} \sum_i^N (x_i-\overline{x}) (y_i-\overline{y}) \]

Sample cross-covariance function for positive values of lag between variables \(x_t\) and \(y_{t+k}\) (Chatfield, The Analysis of Time Series, 2004):

\[ c_{xy}(k) = \frac{1}{N} \sum_{t=1}^{N-k} (x_t-\overline{x})(y_{t+k}-\overline{y}) \]

Sample correlation

Pearson’s correlation coefficient (sample correlation) is defined as the covariance of two variables divided by the product of their standard deviations (which are the square roots of their respective variances):

\[ r_{xy} = \frac{c_{xy}}{\sqrt{c_{xx}c_{yy}}} %= \frac{\sum (x_i-\overline{x})(y_i-\overline{y})}{\sqrt{ \sum (x_i-\overline{x})^2 \sum (y_i-\overline{y})^2 }} \]

The sample cross-correlation function: \[ r_{xy}(k) = \frac{c_{xy}(k)}{\sqrt{c_{xx}(0)c_{yy}(0)}} \]

\(c_{xx}\) and \(c_{yy}\) are the sample variances of \(x\) and \(y\), respectively.

\(c_{xx}(0)\) and \(c_{yy}(0)\) that are the sample variances of \(x_t\) and \(y_t\) respectively.

Correlation coefficients illustrated

“For descriptive purposes, the relationship will be described as strong if \(|r| \geq .8\), moderate if \(.5 < |r| <.8\), and weak if \(|r| \leq .5\).” – Devore and Berk, Modern Mathematical Statistics with Applications, 2012

Devore and Berk, _Modern Mathematical Statistics with Applications_, 2012, Figure 12.24

Anscombe’s quartet classically illustrates the pitfalls on relying on a single coefficient – always visualize your data. Consider the following four datasets:

All have similar statistical properties.

set mean of x mean of y variance of x variance of y correlation intercept slope
1 9 7.5 11 4.127 0.82 3 0.5
2 9 7.5 11 4.128 0.82 3 0.5
3 9 7.5 11 4.123 0.82 3 0.5
4 9 7.5 11 4.123 0.82 3 0.5

Cross (lagged) correlations illustrated

Lag for 20.07.2013 in Lausanne (example for illustrative purposes):

lag_example

  • Radiation initiates the photochemistry for ozone formation and is strongly linked to the concentration of atmospheric oxidants.
  • Increase in radiation also leads to (delayed) increase in temperatures, which accelerate rates of reaction.
  • Emissions of NO suppresses net production of ozone early in the day.
  • VOC precursors also contribute to ozone formation.

A relationship between radiation intensity and ozone concentration may be better described by examining correlations between the daily maximum values of the two variables.

Confounding interpretations

A correlation or lagged correlation in x and y may also be observed. For examples, a correlation between \(x\) and \(y\) may not be due to the causal relationship between \(x\) and \(y\), but dependent on a third variable, \(z\). This is written:

\[ \begin{aligned} y_t &= f_y(z_t)\\ x_t &= f_x(z_t) \end{aligned} \]

\[ \begin{aligned} y_{t+k} &= f_y(z_t)\\ x_{t} &= f_x(z_t) \end{aligned} \]

Correlation does not imply causation”:

  • The variation between two variables may depend on a third, as shown above.
  • Even if the variation in one variable is dependent on another, the correlation does not indicate the direction of causality.

For entertainment purposes, you may wish to visit the website of Spurious correlations.

R demonstration

library(tidyverse)
library(chron)
source("functions_extra.R")
Sys.setlocale("LC_TIME","C")
options(stringsAsFactors=FALSE)
options(chron.year.abb=FALSE)
theme_set(theme_bw()) # just my preference for plots

Let us load data saved from Lesson 4.

df <- readRDS("data/2013/lau-zue.rds")
variables <- c("O3", "NO2", "CO", "PM10", "SO2", "NMVOC", "EC", "TEMP", "PREC", "RAD")

Pairwise correlations

Before computing correlations, we will first visualize the relationships using scatter plots (or x-y plots).

Let us calculate the daily maximum values for each variable:

lf <- gather(df, variable, value, all_of(variables))
daily.max <- lf %>%
  group_by(site, year, month, day, season, variable) %>%
  summarize(value=max(value, na.rm=TRUE)) %>%
  spread(variable, value)

Recall the relationship between temperature and O3 shown in a previous lecture. Note the seasonal dependence of this relationship.

ggplot(daily.max)+
  facet_grid(site~season)+
  geom_point(aes(TEMP, O3))

The corresponding correlation values can be obtained with the built-in function, cor (see ?cor). Note use ofuse=“pairwise.complete.obs”` for calculation of correlation with missing values.

(cor.values <- daily.max %>% group_by(site, season) %>%
   summarize(correlation=cor(TEMP, O3, use="pairwise.complete.obs")))
## # A tibble: 8 x 3
## # Groups:   site [2]
##   site  season correlation
##   <chr> <fct>        <dbl>
## 1 LAU   DJF         0.0760
## 2 LAU   MAM         0.534 
## 3 LAU   JJA         0.802 
## 4 LAU   SON         0.527 
## 5 ZUE   DJF         0.180 
## 6 ZUE   MAM         0.630 
## 7 ZUE   JJA         0.843 
## 8 ZUE   SON         0.648

We can format this table and save to file:

(cor.values.wf <- spread(cor.values, site, correlation))
## # A tibble: 4 x 3
##   season    LAU   ZUE
##   <fct>   <dbl> <dbl>
## 1 DJF    0.0760 0.180
## 2 MAM    0.534  0.630
## 3 JJA    0.802  0.843
## 4 SON    0.527  0.648
write.csv2(cor.values.wf, "correlations.csv", row.names=FALSE)

write.csv2 writes to a CSV (comma separated value) file using the European convention of defining delimiters as semicolons (;) rather than commas (,).

Or, we can visualize the values:

ggplot(cor.values)+
  geom_col(aes(season, correlation))+
  scale_y_continuous(limits=c(0,1))+
  facet_grid(.~site)+
  scale_y_continuous(expand=expansion(mult=c(0, 0.1)))

We can examine the correlation of other pollutants or meterological variables with O3.

gather(daily.max, variable, value, all_of(setdiff(variables, "O3"))) %>% # everything but ozone
  ggplot+
  facet_grid(site~variable, scale="free_x")+
  geom_point(aes(value, O3, group=season, color=season), shape=4)

We will next look at a scatter plot between hourly measurements of CO and NO2. Why does this relationship exist?

ggplot(df)+
  facet_grid(site~season)+
  geom_point(aes(CO, NO2))

For the following scatterplot matrix, we will use a built-in library called lattice which is a plotting system that exists in parallel to R’s base graphics and ggplot2 package (you can check out ggpairs from the GGally package for a ggplot2-based solution). We additionally define a function to include correlation coefficients in our panels.

library(lattice)

CorrelationValue <- function(x, y, ...) {
  correlation <- cor(x, y, use="pairwise.complete.obs") 
  if(is.finite(correlation)) {
    cpl <- current.panel.limits()
    panel.text(mean(cpl$xlim), mean(cpl$ylim),
               bquote(italic(r)==.(sprintf("%.2f", correlation))),
               adj=c(0.5,0.5), col="blue")
  }
}

Plot daily maximum values as pairwise points (only Lausanne) using this syntax:

daily.max <- as.data.frame(daily.max)
ix <- "LAU" == daily.max[["site"]]
splom(~daily.max[ix,c("O3","NO2","CO","PM10","TEMP","PREC","RAD")] | daily.max[ix,"season"],
      upper.panel = CorrelationValue,
      pch=4)

Plot hourly data as pairwise points. Given the large number of points, we can “smooth” the visual representation.

ix <- grepl("LAU", df[["site"]], fixed=TRUE)
splom(~df[ix,c("O3","NO2","CO","PM10","TEMP","PREC","RAD")] | df[ix,"season"],
      upper.panel = CorrelationValue,
      panel = panel.smoothScatter,
      pch=4, cex=.2, col="gray")

Notice some values of correlation went up.

Lagged correlations

We can define a function to generate lagged pairs for \(k\) intervals and apply it using the do() function. If we provide hourly measurements, \(k=1\) corresponds to a lag of 1 hour, and so on.

Lag <- function(pair, k) {
  out <- data.frame(lag=k, head(pair[,1],-k), tail(pair[,2],-k))
  names(out)[2:3] <- colnames(pair)
  out
}

lagged <- df %>%
  group_by(site, season) %>%
    do(rbind(Lag(.[,c("RAD","O3")], 1),
             Lag(.[,c("RAD","O3")], 2),
             Lag(.[,c("RAD","O3")], 3),
             Lag(.[,c("RAD","O3")], 4),
             Lag(.[,c("RAD","O3")], 5),
             Lag(.[,c("RAD","O3")], 6)))
ggplot(lagged) +
  geom_point(aes(RAD, O3, group=site, color=site), shape=4)+
    facet_grid(lag~season)

R includes a function, ccf, for calculating the cross correlation. The value passed to the na.action=na.pass says to compute covariances from the complete cases when there are missing values. Calling this function will generate a corresponding plot by default (which can be turned off). See ?ccf for further details.

ix <- "LAU"==df[["site"]] & "JJA"==df[["season"]]
ccf(df[ix,"RAD"], df[ix,"O3"], lag.max=6, na.action=na.pass)

This tells us that the correlation is higest at a lag of -4 hours.

We can wrap this in a function that returns information that we want in a data frame (note plot=FALSE argument):

LaggedCorrelation <- function(pair, ...) {
  out <- ccf(pair[,1], pair[,2], ..., na.action=na.pass, plot=FALSE)
  data.frame(lag=out[["lag"]], value=out[["acf"]])
}

lagged.values <- df %>% group_by(site, season) %>%
  do(LaggedCorrelation(.[,c("RAD","O3")], lag.max=6))

Note the syntax, ... in the function definition This passes on any additional arguments (lag.max in this case) from the outer function (LaggedCorrelation) to the inner-most function (ccf). ?... will refer you to the Introduction to R manual for further details on this syntax.

Plot:

ggplot(lagged.values)+
  geom_segment(aes(x=lag,xend=lag,y=0,yend=value))+
  facet_grid(site~season)+
  xlab("lag (hours)")+
  ylab("Cross correlation coefficient")